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Abstract 


We present a novel approach for automatic 
report generation from time-series data, in 
the context of student feedback genera¬ 
tion. Our proposed methodology treats 
content selection as a multi-label classi¬ 
fication (MLC) problem, which takes as 
input time-series data (students’ learning 
data) and outputs a summary of these data 
(feedback). Unlike previous work, this 
method considers all data simultaneously 
using ensembles of classifiers, and there¬ 
fore, it achieves higher accuracy and F- 
score compared to meaningful baselines. 


1 Introduction 


Summarisation of time-series data refers to the 
task of automatically generating text from vari¬ 
ables whose values change over time. We con¬ 
sider the task of automatically generating feed¬ 
back summaries for students describing their 
semester-long performance during the lab of a 
Computer Science module. There have been 9 
learning factors identified which contribute to stu¬ 
dents’ learning: (1) marks, (2) hours_studied, 
(3) understandability, (4) difficulty, (5) deadlines, 
(6) health_issues, (7) personal issues, (8) lec- 
tures_attended and (9) revision ( jGkatzia et al., 
20l3]>. 


Gkatzia et al.’s analysis (2013) showed that 
there are 4 ways to refer to a learning factor: 

1. <trend>: describing the trend, 

2. <weeks>: describing what happened at ev¬ 
ery time stamp, 

3. <average>: mentioning the average, or 

4. <other>: making another general statement. 

The task of content selection for feedback gen¬ 
eration can be formulated as a classification task as 
follows: given a set of 9 learning factors, select the 
content that is most appropriate to be included in 
a summary. Content is represented by templates. 
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Figure 1: Feedback generation as a binary classi¬ 
fication problem without history. 


A template is defined as a quadruple consisting of 
an id, & factor, a reference type (trend, weeks, av¬ 
erage, other) and surface text. 

Overall, for all factors there are 29 different 
templates. There are two decisions that need to 
be made: (1) whether to talk about a factor and 
(2) in which way to refer to it. Instead of deal¬ 
ing with this task in a hierarchical way, where the 
algorithm will first decide whether to talk about a 
factor and then will decide how to refer to it, our 
proposed model treats both steps jointly. The pro¬ 
posed method reduces the decision workload by 
deciding either in which way to talk about a fac¬ 
tor, or not to talk about a factor at all. 

2 Multi-label Classification 

Classification is concerned with the identification 
of a category l from a set of disjoint categories L 
(with \ L\ >1) that an instance belongs to, given the 
characteristics of the instance. If \L\ = 2, then the 
learning task is called binary classification, for ex¬ 
ample a task where a classifier is trained to asso- 




















Figure 2: Feedback generation as a binary classification problem with history. 


ciate e-mails with either spam or not (i.e. 1 or 0, 
and hence binary). If L|>2, then the learning task 
is called multi-class classification, for example a 
task where the classifier can associate a running 
area as good, bad or ok. In Multi-label classifica¬ 
tion (MLC), the instances are associated with a set 
of labels Y C L ( |Tsoumakas et al., 2010[ ). For 
example, a newspaper article can be classified into 
health, science, economy, politics, culture etc. A 
specific news article concerning the breakthrough 
of the Ebola cure can be classified into both of the 
categories health and science. In the same way, 
students’ data can be assigned labels that describe 
them, i.e. each label corresponds to a template. 
The set of chosen templates can then form a feed¬ 
back summary. 

One set of factor values can result in various sets 
of templates as interpreted by the different experts, 
i.e. a single student can receive different feed¬ 
back from different lecturers. A multi-label clas¬ 
sifier is able to make decisions for all templates 
simultaneously and capture these differences. The 
RAndom k-labELsets (RAkEL) (|Tsoumakas et al., 


2010 1 is proposed for tackling content selection. 


RAkEL is based on Label Powerset (LP), a prob¬ 
lem transformation method that uses ensembles of 
classifiers. LP benefits from taking into consider¬ 
ation label correlations, but does not perform well 


when trained with few examples (Tsoumakas et 


al., 2010), as in our case (37 instances). RAkEL 


overcomes this limitation by constructing a set of 
LP classifiers, which are trained with different ran¬ 
dom subsets of the set of labels. 


3 Evaluation 

We compare our approach to four meaningful 
baselines: DT (Decision Trees) (no history): 29 
classifiers were trained, each one responsible for 
each template. No history is taken into account 
(see Figure [T]). DT (with predicted history): 29 
classifiers were also trained, but this time the input 
included the previous decisions made by the previ¬ 


Classifier 

Accuracy 

(10-fold) 

Preci¬ 

sion 

Recall 

F- 

score 

DT (no his¬ 
tory) 

*75.95% 

67.56 

75.96 

67.87 

DT (with pre¬ 
dicted history) 

**73.43% 

65.49 

72.05 

70.95 

Majority-class 

**72.02% 

61.73 

77.37 

68.21 

MLC - RAkEL 
(no history) 

76.95% 

85.08 

85.94 

85.50 

DT (with real 
history) 

**78.09% 

74.51 

78.11 

75.54 


Table 1: Average, precision, recall and F-score 
of the different classification methods (t-test, * 
denotes significance with p<0.05 and ** signifi¬ 
cance with p<0.01, when comparing each result 
to RAkEL. 


ous classifiers (i.e. the history) as well as the set of 
time-series data in order to emulate the dependen¬ 
cies in the dataset (see Figure [2]). Majority-class: 
It labels each instance with the most frequent tem¬ 
plate. DT (with real history): A modification of 
the previous approach but the real, expert values 
were used in the model for history rather than the 
predicted ones. 


4 Results and Conclusions 

MLC - RAkEL ( Gkatzia et al., 2014) achieves 
higher accuracy, precision, recall and F-score 
compared to (1) DT (no history), where each tem¬ 
plate is predicted from a separate classifier inde¬ 
pendently, (2) DT (with predicted history), where 
the decision of the previous template is taken into 
account in the next decision, similar to Classifier 
Chains, and (3) a Majority-class baseline. 

This method is powerful due to its ability to 


take into account data correlations (Gkatzia et al., 


2014). Multi-label classification should be used 


when the data to be summarised need to be con¬ 
sidered simultaneously and/or when there are lim¬ 
ited data available, for example, in student feed¬ 
back generation, the lectures a student attended 
is highly correlated with his/her understandability 
(r = 0.6). 
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